1. Linear Regression
    1. Assumption
    2. Accuracy metrics
    3. Regularization
  2. Non-Linear Regression
    1. Logistic Regression

Linear Regression¶

Assumption¶

In [1]:
import sys
import numpy as np
import pandas as pd
sys.path.append(r'../../')
from utils.plots import plot_linear_plot
import plotly
plotly.offline.init_notebook_mode()

seed = np.random.seed(seed=42)
In [2]:
m = 10
b = 2
noise = 50

x = np.arange(start=0, stop=10, step=1)
noise = np.random.randint(low=0, high=noise, size=len(x))

y = m * x + b + noise
In [3]:
# first order polynomial - simple linear regression
coeff = np.polyfit(x=x, y=y, deg=1)
y_pred = (coeff[0] * x**1) + coeff[1]
plot_linear_plot(x, y, y_pred)
In [4]:
# third order polynomial
coeff = np.polyfit(x=x, y=y, deg=3)
y_pred = (coeff[0] * x**3) + (coeff[1] * x**2) + (coeff[2] * x**1) + coeff[3]

plot_linear_plot(x, y, y_pred)
In [5]:
# using numpy polyval function
coeff = np.polyfit(x=x, y=y, deg=8)
y_pred = np.round(np.polyval(coeff, x),2)
plot_linear_plot(x, y, y_pred)

Accuracy metrics¶

  1. MAE (Mean Absolute Error)

    • Pros:
      • The MAE is expressed in the same unit as the output variable.
      • Robust to outliers.
    • Cons:
      • Not differentiable, so it can't be used as a loss function.
  2. MSE (Mean Squared Error)

    • Pros:
      • Differentiable and can be used as a loss function.
    • Cons:
      • Output is in squared units.
      • Not robust to outliers due to squared differences.
  3. RMSE (Root Mean Squared Error)

    • Pros:
      • Output is in the same unit as the target variable.
  4. RMSLE (Root Mean Squared Log Error)

    • Pros:
      • Does not penalize high errors due to the logarithm.
      • Useful when underestimation is unacceptable.
    • Cons:
      • Large penalty for underestimation.
  5. MAPE (Mean Absolute Percentage Error)

    • Pros:
      • Reflects errors for both high and low magnitude values.
    • Cons:
      • Sensitive to outliers.
  6. R2 (R-Squared) $$R^2 = 1 - \frac{{SS_{\text{res}}}}{{SS_{\text{tot}}}}$$

    • Pros:
      • Compares regression line to mean line.
      • Useful for model comparison.
      • Value between 0 and 1 (1 being best). How much variance is explained by your model.
    • Cons:
      • Adding useless features doesn't decrease R2.
  7. Adj R2 (Adjusted R-Squared) $$ \text{Adjusted R}^2 = 1 - \left(1 - R^2\right) \cdot \frac{{n - 1}}{{n - k - 1}} $$

    • Pros:
      • Crucial for model evaluation.
      • Decreases with irrelevant features.

Reference 1 Reference 2

Regularization¶

Regularization is important to maintain bias-variance trade off or overfitting/underfitting.

  1. Bias-Variance Trade-off:

    • Polynomial regression aims to find a balance between bias (underfitting) and variance (overfitting).
    • High-degree polynomials can fit the training data perfectly but may generalize poorly to unseen data (overfitting).
    • Regularization helps control this trade-off.
  2. Why Regularization?:

    • When fitting polynomials, we often face a dilemma:
      • Low-degree polynomials (e.g., linear or quadratic) may underfit the data.
      • High-degree polynomials (e.g., cubic or higher) may overfit the data.
    • Regularization provides a way to address this by introducing a penalty term.
  3. Penalty Term:

    • Regularization adds a penalty to the loss function.
    • The total loss becomes: Loss = Loss Function + Penalty
    • The penalty discourages large coefficients, preventing overfitting.
  4. Types of Regularization:

    • L2 (Ridge) Regularization:
      • Adds the sum of squared coefficients to the loss function.
      • Encourages small coefficients.
      • Helps prevent overfitting.
    • L1 (Lasso) Regularization:
      • Adds the sum of absolute coefficients to the loss function.
      • Encourages sparse models (sets some coefficients to exactly zero).
      • Useful for feature selection.
    • Elastic Net Regularization:
      • Combines L1 and L2 regularization.
      • Balances between sparsity and smoothness.
  5. Effect on Coefficients:

    • Regularization shrinks the coefficients toward zero.
    • Smaller coefficients lead to simpler models.
    • It helps prevent overfitting by reducing the model's complexity.
  6. Continuous Complexity Range:

    • Regularization provides a continuous range of complexity parameters.
    • Unlike choosing a fixed polynomial degree, you can fine-tune the regularization strength.
    • This flexibility allows finding the right balance between bias and variance.

Methods to detect overfitting.¶

  1. Visual Inspection: Plot the fitted line against the data points.
  2. Cross-Validation: Use techniques like k-fold cross-validation to assess model performance on unseen data.
  3. Learning Curves:
    • Plot the model’s performance (e.g., accuracy or loss) against the size of the training dataset.
    • If the training performance keeps improving while the validation performance plateaus or worsens, overfitting could be occurring.
  4. Feature Importance Analysis: If a few features dominate, it might indicate overfitting.
In [ ]:
 

Non Linear Regression¶

Logistic Regression¶